natural_language_understandingfandomcom-20200214-history
Significance test
TODO: What's in a p-value in NLP?, Yeh (2000)Yeh, Alexander. "More accurate tests for the statistical significance of result differences." Proceedings of the 18th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 2000., Rayson (2003)Rayson, Paul. Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Diss. Lancaster University, 2003., Berg-Kirkpatrick et al. (2012)Berg-Kirkpatrick, Taylor, David Burkett, and Dan Klein. "An empirical investigation of statistical significance in nlp." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012., Nivre (2001)Nivre, Joakim. "On statistical methods in natural language processing." Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001). 2001., Cyril and Gaussier (2005)Goutte, Cyril, and Eric Gaussier. "A probabilistic interpretation of precision, recall and F-score, with implication for evaluation." European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2005. Stats in ACL 2018: only 1/5 papers do it right https://twitter.com/catherinehavasi/status/1019030669828112384?s=21 Hitchhiker's guide: Dror et al. (2018)Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1383-1392). Usage Stratified shuffling-based randomization test (Yeh 2000) seems very common. Lee et al. (2015)Lee, K., Artzi, Y., Choi, Y., & Zettlemoyer, L. (2015). Event Detection and Factuality Assessment with Non-Expert Supervision. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1643–1648. use "two-sided bootstrap resampling statistical significance tests (Graham et al., 2014)" Bugert et al. 2017Bugert, M., Puzikov, Y., Andreas, R., Eckle-kohler, J., Martin, T., & Mart, E. (2017). LSDSem 2017 : Exploring Data Generation Methods for the Story Cloze Test. The 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-Level Semantics (LSDSEM 2017), (2016), 56–61.; Zhou et al. 2015Zhou, Mengfei, Anette Frank, Annemarie Friedrich, and Alexis Palmer. “Semantically Enriched Models for Modal Sense Classification.” In Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem), p. 44. 2015., Nirve et al. (2009)Nivre, J., Kuhlmann, M., & Hall, J. (2009). An Improved Oracle for Dependency Parsing with Online Reordering. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09) (pp. 73–76). Paris, France: Association for Computational Linguistics. use McNemar's test I saw some paper(s) use Koehn's subsampling procedure (Koehn 2004)Koehn, P. (2004). Statistical significance tests for machine translation evaluation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 4'', 388–395. http://doi.org/10.1145/2063576.2063688 Zapirain et al. (2013)Zapirain, B., Agirre, E., Màrquez, L., & Surdeanu, M. (2013). Selectional Preferences for Semantic Role Classification. ''Computational Linguistics, 39(3).: "we checked for statistical significance using bootstrap resampling (100 samples) coupled with one-tailed paired t-test (Noreen 1989)." Bengtson & Roth (2008)Bengtson, E., & Roth, D. (2008). Understanding the value of features for coreference resolution. Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, 51(October), 294. http://doi.org/10.3115/1613715.1613756: "paired non-parametric bootstrapping percentile test". Batchkarov et al. (2016)ACL 2016. http://sro.sussex.ac.uk/62044/1/acl2016.pdf use "bootstrapping" to estimate variance and later on hint on statistical significance. Gerber and Chain (2012)Gerber, M. S., & Chai, J. Y. (2012). Semantic role labeling of implicit arguments for nominal predicates. Computational Linguistics, 38(4), 755–798. http://doi.org/10.1162/COLI_a_00110: "We used a bootstrap resampling technique similar to those developed by Efron and Tibshirani (1993) to test the significance of the performance difference between various systems." Random variation TODO: Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging Deep Reinforcement Learning that Matters (https://arxiv.org/abs/1709.06560) References Category:Statistics